An instructive example using Stan…

1. Modeling HW with Data from the Current Population Survey (CPS)

Setting up and stuff…

1.1 Prior Predictive Distribution: A Simple Model

Fig. 1: Hourly Wages Model

1.2 Drawing from the Prior Predictive Distribution

rstan::expose_stan_functions("quantile_functions.stan")
source("GLD_helpers.R")
summary(cps$rw)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.079  10.794  16.299  20.056  25.690 188.898   13322

NB I am recoding the “wbho” variable (Race) to be a Boolean (White/Non-White) 1/0.

Now, using the theoretical model above - where I think that one’s gender, race, marrital status, and age (in decades) have a bearing on one’s hourly wages, plus some intercept and an error term - let’s construct some prior predictive distribution…

set.seed(42)
# y = m*x + c
# y = alpha + beta*x

# b1/Female = cps$female, b2/Race = cps$wbho, b3/Married = cps$married, b4/Age = cps$age

# Setting up my priors...
# NB Median should be about equidistant between the lower and upper quartiles.
(a_s1 <- GLD_solver(lower_quartile = -0.3, median = -0.2, upper_quartile = -0.1, # GLD priors for coefficients.
                    other_quantile = 0.2, alpha = 0.9))
## asymmetry steepness 
## 0.0000000 0.9653298
(a_s2 <- GLD_solver(lower_quartile = -0.01, median = 0, upper_quartile = 0.01,
                    other_quantile = 0.03, alpha = 0.9))
## asymmetry steepness 
## 0.0000000 0.9283763
(a_s3 <- GLD_solver(lower_quartile = 0.3, median = 0.5, upper_quartile = 0.65, 
                    other_quantile = 0.85, alpha = 0.9))
##  asymmetry  steepness 
## -0.4226581  0.9033965
(a_s4 <- GLD_solver(lower_quartile = 0, median = 0.2, upper_quartile = 0.3, 
                    other_quantile = 0.5, alpha = 0.9))
##  asymmetry  steepness 
## -0.6771304  0.9722460
m_alpha <- log(16); s_alpha <- 0.1 # Normal prior for alpha. At average variable values (x1, x2, x3, x4)...
q_sigma <- c(lower = 0.25, median = 1, upper = 2)

d <- model.matrix(log(rw) ~ female + wbho + married + age_dec, # Creating a centered matrix with our data.
                  data = cps)[ , -1] # To drop the intercept.
d <- sweep(d, MARGIN = 2, STATS = colMeans(d), FUN = `-`) # Centering the columns.

ppd <- replicate(1000, {
  alpha_ <- rnorm(1, mean = m_alpha, sd = s_alpha)
  sigma_ <- JQPDS_rng(lower_bound = 0, alpha = 0.25, quantiles = q_sigma) # Semi-bounded for sigma.
  beta1_ <- GLD_rng(median = -0.2, IQR = -0.1 - (-0.3), asymmetry = a_s1[1], steepness = a_s1[2]) # IQR = upper - lower.
  beta2_ <- GLD_rng(median = 0, IQR = 0.01 - (-0.01), asymmetry = a_s2[1], steepness = a_s2[2])
  beta3_ <- GLD_rng(median = 0.5, IQR = 0.65 - 0.3, asymmetry = a_s3[1], steepness = a_s3[2])
  beta4_ <- GLD_rng(median = 0.2, IQR = 0.3 - 0, asymmetry = a_s4[1], steepness = a_s4[2])
  mu_ <- alpha_ + beta1_ * d[ , "female"] + beta2_ * d[ , "wbho"] + beta3_ * d[, "married"] + beta4_ * I(d[, "age_dec"] ^ 2)
  epsilon_ <- rnorm(n = length(mu_), mean = 0, sd = sigma_) # Some error term.
  y_ <- mu_ + epsilon_ # y_ has a normal distribution with expectation mu_.
  y_
})

round(quantile(c(exp(ppd)), probs = 1:9 / 10), digits = 3)
##     10%     20%     30%     40%     50%     60%     70%     80%     90% 
##   0.796   4.699   9.661  14.476  19.478  25.829  37.660  69.993 248.865

This gives an unreasonable distribution at the ninth and first deciles. (Too high and too low, respectively.) We would need to either change the priors and/or include more variables in the model.

1.3 Drawing from the Posterior Distribution

post <- stan_lm(log(rw) ~ female + wbho + married + I(age_dec ^ 2), data = cps,
                prior_intercept = normal(location = m_alpha, scale = s_alpha, autoscale = FALSE), # location from m_alpha.
                prior = R2(location = median(r_squared), what = "median"), adapt_delta = 0.99) # location obtained from my R^2 above.

print(post, digits = 4)
## stan_lm
##  family:       gaussian [identity]
##  formula:      log(rw) ~ female + wbho + married + I(age_dec^2)
##  observations: 14799
##  predictors:   5
## ------
##              Median  MAD_SD 
## (Intercept)   2.8281  0.0169
## female       -0.3055  0.0089
## wbho         -0.1041  0.0117
## married       0.2311  0.0093
## I(age_dec^2)  0.0077  0.0005
## 
## Auxiliary parameter(s):
##               Median MAD_SD
## R2            0.1483 0.0054
## log-fit_ratio 0.0000 0.0057
## sigma         0.5493 0.0032
## 
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg
loo(post) # Pareto smoothed importance-sampling leave-one-out cross-validation (PSIS-LOO).
## 
## Computed from 4000 by 14799 log-likelihood matrix
## 
##          Estimate    SE
## elpd_loo -12138.9  98.6
## p_loo         6.6   0.2
## looic     24277.7 197.2
## ------
## Monte Carlo SE of elpd_loo is 0.0.
## 
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.
pp_check(post, plotfun = "loo_pit_overlay")
## Running PSIS to compute weights...

We see that the model - as is - is decidedly lacking in the lower tail though otherwise seems to fit reasonably.

1.4 Interpreting the Posterior Coefficients

##   (Intercept)        female          wbho       married       age_dec 
##          TRUE         FALSE         FALSE          TRUE          TRUE 
##         sigma log-fit_ratio            R2 
##          TRUE         FALSE          TRUE

Somewhat predictably different, perhaps, in the direction but not so in the magnitude, one might say.

1.5 Model Diagnostics

library(bayesplot)
mcmc_areas_ridges(post, regex_pars = "^[^(]") # Excluding (Intercept).

# I'm interested in the variables and mean PPD...
pairs(post, regex_pars = "^[^(lRs]") # Exclude parameters that start with (, l, R, or s. This takes care of (Intercept), sigma, log-fit_ratio, R2, and log-posterior.

While the model originally yielded a few divergent transitions below the diagonal, indicating - empirically - that it’s having a hard time dealing with the tails - though we knew that much from the overlaid density, in which case it may have been tempting to consider changing our model to simplify the geometry of the posterior distribution - it is somewhat oblique having to deal with divergent transitions below the diagonal. By setting my target acceptance rate at .999 - the default adapt_delta parameter being .8 - I have accomplished to take care of that particular problem without having to amend the model or my priors.